Skip to main content

File Descriptors and I/O

The Incident: 3 AM, Zero Open Connections, Service Dead

In October 2022, a data pipeline service at a logistics company stopped accepting Kafka messages at 3:15 AM. Monitoring showed the service was healthy - CPU at 4%, memory normal, no errors in the application log. On-call engineers restarted the service and it recovered immediately. The same failure repeated for three nights.

The root cause: the service opened a new file descriptor for each log record written to a rotating log file, but never closed the old file descriptor after log rotation. After 9 hours of operation at ~100 records/second, the process accumulated 1024 open FDs - the default kernel limit. After that point, every open(), accept(), and socket() call returned EMFILE: Too many open files. The Kafka consumer silently dropped messages because it could not open sockets to fetch new partitions.

Diagnosis: lsof -p PID | wc -l returned 1025. lsof -p PID | grep log | head -20 showed 900+ FDs pointing to the same log file path. The fix was a one-line file.close() in the log handler and a resource.setrlimit(RLIMIT_NOFILE, (65536, 65536)) call at startup.

This lesson explains every layer of that failure.

What Is a File Descriptor?

A file descriptor (FD) is a small non-negative integer returned by the kernel to represent an open file, socket, pipe end, device, or other I/O endpoint. The integer is an index into the per-process file descriptor table, which the kernel maintains for each process.

Per-process FD table (in-kernel) Open File Table (system-wide)
───────────────────────────────── ──────────────────────────────────

fd 0 ─────────────────────────────────► entry: /dev/pts/0, offset=0, flags=O_RDWR
fd 1 ─────────────────────────────────► entry: /dev/pts/0, offset=0, flags=O_RDWR
fd 2 ─────────────────────────────────► entry: /dev/pts/0, offset=0, flags=O_RDWR
fd 3 ─────────────────────────────────► entry: /var/log/app.log, offset=18432, flags=O_WRONLY
fd 4 ─────────────────────────────────► entry: socket:[tcp,0.0.0.0:8080], state=LISTEN
fd 5 ─────────────────────────────────► entry: socket:[tcp,10.0.0.1:8080->client:49201]
fd 6 ─────────────────────────────────► entry: pipe:[99031], read end
fd 7 ─────────────────────────────────► entry: pipe:[99031], write end


Inode (VFS abstraction over filesystem,
socket, pipe, device - all unified)

Key insight: the integers 0, 1, 2 are not special to the kernel - they are conventional. Stdin is FD 0 because the shell opens the terminal and passes FD 0 to children. You can make FD 0 a file, a pipe, or a socket - shells redirect stdin this way.

Multiple FDs can reference the same open file description (they share the same seek offset and status flags). This happens with dup() and with fork() (child inherits all parent FDs, pointing to the same kernel file descriptions).

os.open() vs open(): POSIX Flags

Python's open() is a high-level wrapper. os.open() is the direct POSIX open(2) syscall. You need os.open() when you need flags that open() doesn't expose.

import os

# open() - high-level, returns a file object with buffering
f = open("/tmp/test.txt", "w")
f.write("hello\n")
f.close()

# os.open() - direct syscall, returns an integer FD
# Flags combine with bitwise OR
fd = os.open(
"/tmp/test_raw.txt",
os.O_WRONLY | os.O_CREAT | os.O_TRUNC, # create/truncate for writing
0o644, # permission bits (owner rw, group r, other r)
)
os.write(fd, b"hello raw\n")
os.close(fd)

The O_ Flags Reference

import os

# Access mode flags (exactly one required):
# O_RDONLY open for reading only
# O_WRONLY open for writing only
# O_RDWR open for reading and writing

# Creation flags:
# O_CREAT create the file if it doesn't exist (requires mode argument)
# O_EXCL with O_CREAT: fail if file exists (atomic test-and-create)
# O_TRUNC if file exists: truncate to length 0 before opening

# Write behavior flags:
# O_APPEND all writes atomically seek to end-of-file first
# ESSENTIAL for log files written by multiple processes/threads
# O_SYNC wait for physical storage write to complete before returning
# Use for write-ahead logs where durability guarantees are required
# O_DSYNC like O_SYNC but only waits for data (not metadata) persistence

# I/O behavior flags:
# O_NONBLOCK open in non-blocking mode; read/write return EAGAIN instead of blocking
# O_CLOEXEC set FD_CLOEXEC automatically (close on exec, more efficient than fcntl)
# O_DIRECT bypass page cache, I/O goes directly to storage
# requires aligned buffers; used by databases (PostgreSQL, MySQL)

# Practical examples:

# Atomic log file write - multiple processes/threads safe
log_fd = os.open(
"/var/log/app.log",
os.O_WRONLY | os.O_CREAT | os.O_APPEND | os.O_CLOEXEC,
0o644,
)
os.write(log_fd, b"2026-03-07 ERROR something bad happened\n")

# Atomic file creation - fails if file exists (prevents race conditions)
try:
lock_fd = os.open(
"/tmp/app.lock",
os.O_WRONLY | os.O_CREAT | os.O_EXCL | os.O_CLOEXEC,
0o600,
)
os.write(lock_fd, str(os.getpid()).encode())
os.close(lock_fd)
print("Lock acquired")
except FileExistsError:
print("Another instance is running")

# Non-blocking file open (useful for FIFOs - regular open would block)
fifo_fd = os.open("/tmp/myfifo", os.O_RDONLY | os.O_NONBLOCK)

File Descriptor Inheritance and FD_CLOEXEC

When a process calls fork(), the child inherits all open file descriptors. When a process calls exec(), by default it also inherits all open FDs - unless those FDs have FD_CLOEXEC set, which causes them to be automatically closed upon exec().

import os
import fcntl
import subprocess

# The FD leak problem:
sock = __import__("socket").socket()
sock.connect(("db.internal", 5432))
print(f"Database socket FD: {sock.fileno()}")

# Without FD_CLOEXEC, this subprocess inherits our database socket
proc = subprocess.Popen(["ls", "/tmp"], stdout=subprocess.PIPE, close_fds=False)
# Now 'ls' has an open FD to our database! Unexpected behavior.
proc.wait()

# Fix 1: close_fds=True (default) - closes all FDs > 2 in child before exec
proc2 = subprocess.Popen(["ls", "/tmp"], stdout=subprocess.PIPE, close_fds=True)
proc2.wait()

# Fix 2: Set FD_CLOEXEC on specific FDs you want to inherit but not leak
def set_cloexec(fd: int) -> None:
"""Mark fd as close-on-exec."""
flags = fcntl.fcntl(fd, fcntl.F_GETFD)
fcntl.fcntl(fd, fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)

set_cloexec(sock.fileno())

# Fix 3: Use O_CLOEXEC at open time (no race condition with fork+exec)
fd = os.open("/tmp/safe.txt", os.O_RDONLY | os.O_CLOEXEC)

Duplicating File Descriptors: dup() and dup2()

os.dup(fd) creates a new FD that refers to the same open file description as fd. The new FD gets the lowest available FD number. os.dup2(fd, newfd) duplicates fd to a specific target number, closing newfd first if it was open. This is how shell I/O redirection is implemented.

import os
import sys

# Shell redirection "python script.py > output.txt" implemented manually:

# Open the output file
out_fd = os.open("/tmp/redirect_test.txt", os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o644)

# Save the original stdout FD
original_stdout_fd = os.dup(1)

# Replace FD 1 (stdout) with our file
os.dup2(out_fd, 1)
os.close(out_fd) # out_fd is now redundant; FD 1 points to the file

# Now any write to FD 1 goes to /tmp/redirect_test.txt
os.write(1, b"This goes to the file\n")
sys.stdout.write("This also goes to the file (stdout)\n")
sys.stdout.flush()

# Restore original stdout
os.dup2(original_stdout_fd, 1)
os.close(original_stdout_fd)

# Verify
print("This prints to the terminal again")


# Pipe redirection: "cmd1 | cmd2" in Python
def pipe_demo():
read_fd, write_fd = os.pipe()

pid = os.fork()
if pid == 0:
# Child: stdout -> write end of pipe
os.dup2(write_fd, 1)
os.close(read_fd)
os.close(write_fd)
os.execvp("ls", ["ls", "-la", "/tmp"])
os._exit(1)
else:
# Parent: read from pipe
os.close(write_fd)
output = b""
while True:
chunk = os.read(read_fd, 4096)
if not chunk:
break
output += chunk
os.close(read_fd)
os.waitpid(pid, 0)
print(f"Pipe captured {len(output)} bytes of ls output")
print(output[:200].decode())

pipe_demo()

The io Module Internals: Buffering Stack

Python's open() returns a layered object. Understanding the layers explains flush behavior, partial reads, and performance.

open("file.txt", "r")


TextIOWrapper ← text encoding/decoding, universal newlines
│ wraps a BufferedReader

BufferedReader ← userspace read buffer (default 8192 bytes)
│ reads large chunks from raw, serves small reads

FileIO (RawIOBase) ← calls os.read(fd, n) directly


Kernel FD table entry


Page cache (kernel) ← 4 KB pages cached in RAM


Physical storage
import io

# Access the underlying layers
with open("/etc/hosts", "r") as f:
print(type(f)) # <class '_io.TextIOWrapper'>
print(type(f.buffer)) # <class '_io.BufferedReader'>
print(type(f.buffer.raw)) # <class '_io.FileIO'>
print(f.buffer.raw.fileno()) # the integer FD

# Buffer size for buffered I/O
print(f.buffer._CHUNK_SIZE) # default 8192

# Binary buffered I/O
with open("/tmp/data.bin", "wb") as f:
print(type(f)) # <class '_io.BufferedWriter'>
# Data sits in userspace buffer until:
# 1. Buffer fills (8192 bytes default)
# 2. f.flush() is called
# 3. f.close() is called
# 4. BufferedWriter.write() returns (it always buffers)
f.write(b"x" * 1000)
f.flush() # pushes buffer to kernel page cache
os.fsync(f.fileno()) # waits for kernel to write to physical storage

# Raw unbuffered I/O - reads/writes go directly to the kernel
with open("/tmp/raw.bin", "wb", buffering=0) as f:
print(type(f)) # <class '_io.FileIO'>
f.write(b"x" * 1000) # directly calls write(2) syscall

# BytesIO: in-memory file - same interface, no syscalls
buf = io.BytesIO()
buf.write(b"hello ")
buf.write(b"world")
buf.seek(0)
print(buf.read()) # b'hello world'

Flushing vs fsync

import os

# f.flush() moves data from Python's userspace buffer to the kernel page cache
# The kernel may hold it in RAM for seconds before writing to disk
with open("/tmp/important.txt", "w") as f:
f.write("critical data\n")
f.flush() # data is in kernel now, but NOT necessarily on disk

# os.fsync(fd) tells the kernel to write the page cache to physical storage
# Blocks until the drive confirms the write
with open("/tmp/important.txt", "w") as f:
f.write("critical data\n")
f.flush()
os.fsync(f.fileno()) # guaranteed durable now

# os.fdatasync(fd): like fsync but skips metadata (access time, etc.)
# Faster than fsync; sufficient for data durability
with open("/tmp/important.txt", "w") as f:
f.write("critical data\n")
f.flush()
os.fdatasync(f.fileno())

select, poll, and epoll Compared

Featureselectpollepoll
FD limit1024 (FD_SETSIZE)UnlimitedUnlimited
ComplexityO(n) per callO(n) per callO(1) per ready event
Data copyCopies FD set kernel↔userCopies pollfd arrayepoll_ctl adds FD once
PlatformAll POSIXAll POSIX (not macOS efficiently)Linux only
Level/edgeLevel-triggered onlyLevel-triggered onlyBoth
Best for< 100 FDs, portability< 10K FDs> 10K FDs, production servers
import select
import socket
import time

def benchmark_select_vs_epoll(n_sockets: int = 100):
"""
Create n idle sockets and measure how long select vs epoll takes
to poll them with no events (demonstrating O(n) vs O(1)).
"""
# Create n connected loopback sockets
pairs = []
for _ in range(n_sockets):
a, b = socket.socketpair()
a.setblocking(False)
pairs.append((a, b))

read_fds = [a for a, b in pairs]

# Benchmark select
t0 = time.perf_counter()
for _ in range(10_000):
select.select(read_fds, [], [], 0)
select_time = time.perf_counter() - t0

# Benchmark epoll (Linux only)
if hasattr(select, "epoll"):
ep = select.epoll()
for fd in read_fds:
ep.register(fd.fileno(), select.EPOLLIN)

t0 = time.perf_counter()
for _ in range(10_000):
ep.poll(timeout=0)
epoll_time = time.perf_counter() - t0
ep.close()

print(f"n_sockets={n_sockets}")
print(f" select: {select_time*1000:.1f}ms total, {select_time/10:.1f}μs/call")
print(f" epoll: {epoll_time*1000:.1f}ms total, {epoll_time/10:.1f}μs/call")
print(f" speedup: {select_time/epoll_time:.1f}x")

for a, b in pairs:
a.close()
b.close()

benchmark_select_vs_epoll(100)
benchmark_select_vs_epoll(1000)

Concurrent File Reader with epoll

import select
import os

def read_files_concurrently_epoll(paths: list[str]) -> dict[str, bytes]:
"""
Read multiple files concurrently using epoll.
On a fast NVMe drive with io_uring this would be truly async;
here it demonstrates the epoll interface with regular files
(which are always "ready" on Linux - epoll on regular files is level-triggered).
"""
fds = {}
results = {}

for path in paths:
fd = os.open(path, os.O_RDONLY | os.O_NONBLOCK)
fds[fd] = path
results[path] = b""

if not hasattr(select, "epoll"):
# macOS fallback using select
while fds:
readable, _, _ = select.select(list(fds.keys()), [], [], 0.1)
for fd in readable:
chunk = os.read(fd, 65536)
if chunk:
results[fds[fd]] += chunk
else:
os.close(fd)
del fds[fd]
return results

ep = select.epoll()
for fd in fds:
ep.register(fd, select.EPOLLIN)

try:
while fds:
events = ep.poll(timeout=1.0)
for fd, event in events:
if event & select.EPOLLIN:
chunk = os.read(fd, 65536)
if chunk:
results[fds[fd]] += chunk
else:
ep.unregister(fd)
os.close(fd)
del fds[fd]
finally:
ep.close()

return results

contents = read_files_concurrently_epoll(["/etc/hosts", "/etc/hostname", "/etc/os-release"])
for path, data in contents.items():
print(f"{path}: {len(data)} bytes")

mmap for Zero-Copy File Access

mmap maps a file (or anonymous memory region) directly into the process's virtual address space. Reading or writing the mapped region accesses the file data directly through the kernel's page cache - no read()/write() syscalls, no data copies between user space and kernel space.

Without mmap: With mmap:

open() -> fd mmap() -> ptr

f.read(n) data = ptr[0:n]
│ │
▼ ▼
kernel copies page cache ──► user buf CPU accesses page cache directly
(one copy) (zero copies after page fault)
import mmap
import os

# Create a test file
with open("/tmp/mmap_test.bin", "wb") as f:
f.write(b"A" * 1024 * 1024) # 1 MB

# Read a file using mmap
with open("/tmp/mmap_test.bin", "r+b") as f:
# Map the entire file into memory
# prot: protection bits (PROT_READ | PROT_WRITE)
# flags: MAP_SHARED = changes visible to other processes / written back to file
# MAP_PRIVATE = copy-on-write, changes not written back
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
print(f"Mapped {len(mm)} bytes")
print(f"First 10 bytes: {mm[:10]}")

# Random access - O(1), no seek required
print(f"Byte at offset 500000: {mm[500000:500001]}")

# Search within the mapped region
idx = mm.find(b"AAAA", 100)
print(f"Pattern found at offset: {idx}")

# Read-write mmap: modify file without explicit write calls
with open("/tmp/mmap_rw.bin", "w+b") as f:
f.write(b"\x00" * 4096)
f.flush()

with mmap.mmap(f.fileno(), 4096, access=mmap.ACCESS_WRITE) as mm:
mm[0:5] = b"hello"
mm[100:105] = b"world"
# Changes are written back to the file by the kernel (MAP_SHARED default)

# Anonymous mmap - backed by swap, not a file; useful for shared memory between fork()
anon_mm = mmap.mmap(-1, 65536) # -1 = no file descriptor (anonymous)
anon_mm[0:5] = b"hello"
print(f"Anonymous mmap: {anon_mm[:5]}")
anon_mm.close()

Memory-Mapping NumPy Arrays for Fast I/O

import numpy as np
import os

# Write a large array to disk
arr = np.arange(10_000_000, dtype=np.float64)
arr.tofile("/tmp/large_array.bin")

# numpy.memmap: backed by mmap - data loaded on demand, never all in RAM at once
# Perfect for arrays larger than available RAM
mm_arr = np.memmap("/tmp/large_array.bin", dtype=np.float64, mode="r")
print(f"Shape: {mm_arr.shape}")
print(f"Sum of first 1000: {mm_arr[:1000].sum():.0f}") # only first pages loaded

# Compare performance
import time

# Standard read: copies all 80 MB from kernel to user space
t0 = time.perf_counter()
raw = np.fromfile("/tmp/large_array.bin", dtype=np.float64)
read_time = time.perf_counter() - t0

# mmap read: zero copy, on-demand page faults
t0 = time.perf_counter()
mapped = np.memmap("/tmp/large_array.bin", dtype=np.float64, mode="r")
_ = mapped.sum() # access triggers page faults
mmap_time = time.perf_counter() - t0

print(f"np.fromfile: {read_time*1000:.1f}ms")
print(f"np.memmap: {mmap_time*1000:.1f}ms")

os.sendfile(): Zero-Copy File Transfer

sendfile(2) transfers data directly from a file descriptor to a socket in the kernel, without the data ever passing through user space. This is how nginx serves static files so efficiently.

Without sendfile: With sendfile:

fd = open(file) fd = open(file)
data = read(fd, buf) ← copy 1 sendfile(sock_fd, fd, offset, count)
write(sock_fd, buf) ← copy 2 │

Kernel: DMA from storage ──► socket buffer
Zero user-space copies
import os
import socket
import threading
import time

def serve_file_with_sendfile(filepath: str, host: str = "127.0.0.1", port: int = 9100):
"""Serve a single file over TCP using os.sendfile (Linux/macOS)."""

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind((host, port))
server.listen(1)

conn, addr = server.accept()
try:
with open(filepath, "rb") as f:
file_fd = f.fileno()
file_size = os.path.getsize(filepath)
sock_fd = conn.fileno()

sent = 0
offset = 0
chunk_size = 1024 * 1024 # 1 MB chunks

t0 = time.perf_counter()
while sent < file_size:
to_send = min(chunk_size, file_size - sent)
n = os.sendfile(sock_fd, file_fd, offset, to_send)
if n == 0:
break
sent += n
offset += n

elapsed = time.perf_counter() - t0
throughput = sent / elapsed / (1024 * 1024)
print(f"sendfile: sent {sent} bytes in {elapsed*1000:.1f}ms ({throughput:.0f} MB/s)")
finally:
conn.close()
server.close()


def download_file(host: str, port: int, save_path: str):
"""Download the file sent by serve_file_with_sendfile."""
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((host, port))
with open(save_path, "wb") as f:
while True:
chunk = sock.recv(65536)
if not chunk:
break
f.write(chunk)
sock.close()
print(f"Downloaded to {save_path}: {os.path.getsize(save_path)} bytes")


# Create a test file and demo sendfile
os.makedirs("/tmp/sendfile_test", exist_ok=True)
test_file = "/tmp/sendfile_test/data.bin"
with open(test_file, "wb") as f:
f.write(os.urandom(50 * 1024 * 1024)) # 50 MB random data

server_thread = threading.Thread(
target=serve_file_with_sendfile,
args=(test_file,),
daemon=True,
)
server_thread.start()
time.sleep(0.1)
download_file("127.0.0.1", 9100, "/tmp/sendfile_test/received.bin")

/proc/self/fd: Detecting FD Leaks

import os
import subprocess

def count_open_fds(pid: str = "self") -> int:
"""Count open file descriptors for a process."""
try:
fds = os.listdir(f"/proc/{pid}/fd")
return len(fds)
except (PermissionError, FileNotFoundError):
# macOS fallback: use lsof
result = subprocess.run(
["lsof", "-p", str(os.getpid()), "-F", "f"],
capture_output=True, text=True
)
return result.stdout.count("f") - 1


def find_fd_leaks():
"""Detect FD leaks by comparing before/after a code block."""
before = count_open_fds()
print(f"FDs before: {before}")

# Code under test - this leaks a file descriptor
leaked_fds = []
for _ in range(10):
fd = os.open("/tmp/leak_test.txt", os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o644)
leaked_fds.append(fd) # intentionally not closing!

after = count_open_fds()
print(f"FDs after: {after}")
print(f"Leaked: {after - before} FDs")

# Fix: close the leaked FDs
for fd in leaked_fds:
os.close(fd)

print(f"FDs after cleanup: {count_open_fds()}")


find_fd_leaks()

fcntl Module: File Control

The fcntl module exposes the fcntl(2) and ioctl(2) system calls, used for non-blocking mode, FD flags, and file locking.

import fcntl
import os

# Get and set file descriptor flags
fd = os.open("/tmp/test.txt", os.O_RDWR | os.O_CREAT, 0o644)

# F_GETFD: get file descriptor flags (FD_CLOEXEC)
fd_flags = fcntl.fcntl(fd, fcntl.F_GETFD)
print(f"FD flags: {fd_flags}")

# F_SETFD: set FD_CLOEXEC
fcntl.fcntl(fd, fcntl.F_SETFD, fd_flags | fcntl.FD_CLOEXEC)

# F_GETFL: get file status flags (O_RDONLY, O_WRONLY, O_NONBLOCK, etc.)
fl_flags = fcntl.fcntl(fd, fcntl.F_GETFL)
print(f"File flags: {fl_flags}")

# Add O_NONBLOCK to an already-open file
fcntl.fcntl(fd, fcntl.F_SETFL, fl_flags | os.O_NONBLOCK)

os.close(fd)

# File locking with flock
# Useful for advisory locking between cooperating processes
# (not enforced by the kernel for non-cooperative processes)

def acquire_file_lock(lockfile: str, exclusive: bool = True) -> int:
"""
Acquire an advisory file lock.
Returns the FD - keep it open while the lock is held.
Close the FD to release the lock.
"""
fd = os.open(lockfile, os.O_CREAT | os.O_RDWR, 0o600)
try:
lock_type = fcntl.LOCK_EX if exclusive else fcntl.LOCK_SH
# LOCK_NB: non-blocking - raises BlockingIOError if lock not available
fcntl.flock(fd, lock_type | fcntl.LOCK_NB)
return fd
except BlockingIOError:
os.close(fd)
raise RuntimeError(f"Could not acquire lock on {lockfile}")


def release_file_lock(fd: int) -> None:
"""Release a file lock and close the FD."""
fcntl.flock(fd, fcntl.LOCK_UN)
os.close(fd)


try:
lock_fd = acquire_file_lock("/tmp/myapp.lock")
print("Lock acquired - doing exclusive work")
# ... critical section ...
release_file_lock(lock_fd)
print("Lock released")
except RuntimeError as e:
print(f"Lock failed: {e}")

Interview Q&A

Q1: What is the difference between a file descriptor and a file description, and what happens to both during fork()?

A file descriptor (FD) is an integer index into the process's per-process FD table - it is process-local and has no meaning outside that process. A file description (or "open file description") is a kernel-maintained object in the system-wide open file table that holds the current file offset, status flags (O_APPEND, O_NONBLOCK, etc.), and a reference to the inode. Multiple FDs can reference the same file description - they share the offset and flags.

During fork(), the child gets a copy of the parent's FD table - the same FD integers pointing to the same file descriptions in the kernel's open file table. The reference count on each file description is incremented. If the parent has FD 3 pointing to a file at offset 1000, the child also has FD 3 pointing to the same file description at offset 1000. If the child then reads 100 bytes, the shared offset advances to 1100 - and the parent's FD 3 now also reads from offset 1100. This shared-offset behavior is intentional for pipes (parent writes, child reads sequentially) but can be surprising for regular files. To get independent offsets, each process must call open() separately, creating distinct file descriptions.

Q2: Explain the I/O buffering stack in Python's io module. When does data actually reach the disk?

Python's open() constructs a three-layer stack. The bottom layer is FileIO (a RawIOBase) which calls os.read()/os.write() syscalls directly. Above it is BufferedReader/BufferedWriter which maintains a userspace buffer - default 8192 bytes. Writes accumulate in this buffer; a syscall is only issued when the buffer fills, flush() is called, or the file closes. The top layer is TextIOWrapper which handles encoding, decoding, and universal newlines.

Data reaches disk in stages: (1) f.write(data) puts data in the userspace BufferedWriter buffer. (2) f.flush() calls write(2) to move data from the userspace buffer into the kernel's page cache (in RAM). (3) os.fsync(f.fileno()) calls fsync(2) which tells the kernel to flush its page cache for this file to physical storage and waits for the drive to acknowledge. For durable writes (WAL, transaction logs), you need all three calls. For performance-only scenarios where crash safety isn't required, flush() is sufficient. For maximum performance where even a crash is acceptable (bulk import, re-creatable data), you can omit flush and let the kernel manage writeback on its own schedule.

Q3: How does mmap achieve zero-copy file access, and what are its limitations?

mmap(2) asks the kernel to create a virtual memory area (VMA) in the process's address space that is backed by a file (or swap space for anonymous mmap). No data is copied at mmap() time - the kernel only records the mapping. When the process first accesses a mapped page (reads a byte within the region), the CPU generates a page fault. The kernel's page fault handler loads the corresponding 4 KB page from the file into the page cache and maps it into the process's address space. Subsequent accesses to the same page are satisfied directly from the page cache without any syscall. The CPU accesses the data as if it were ordinary RAM - which it is.

Limitations: (1) Virtual address space exhaustion - 32-bit processes can mmap at most ~3 GB total; 64-bit processes are limited by available RAM and swap. (2) mmap on regular files is always "ready" from epoll's perspective - epoll cannot be used to detect when a file has new data; use inotify for that. (3) Accessing a mmap'd region that has been truncated below the access point triggers SIGBUS - you must handle this with mmap.error or signal handling. (4) mmap of network filesystems (NFS, FUSE) may cause hangs if the network is slow, because page faults block synchronously. (5) Write performance with MAP_SHARED depends on page writeback timing; for durable writes, msync() (not directly exposed in Python but callable via ctypes) is needed.

Q4: What is sendfile() and how does it eliminate copies that would otherwise occur?

Without sendfile, serving a file over a socket requires: (1) read(file_fd, user_buf, n) - kernel copies n bytes from page cache to userspace buffer; (2) write(sock_fd, user_buf, n) - kernel copies n bytes from userspace buffer to socket send buffer. That is two kernel-to-user-space copy operations per chunk, plus two syscalls.

sendfile(out_fd, in_fd, offset, count) is a single syscall that transfers data from in_fd (a file) to out_fd (a socket) entirely within the kernel. On systems with DMA (which is all modern hardware), the kernel can use scatter-gather DMA to move data directly from the storage device controller's buffer to the network interface card's buffer - the data never passes through the CPU's general-purpose registers or user space at all. This is called "zero-copy." The practical result: serving a 100 MB file with sendfile uses roughly half the CPU and achieves nearly wire-speed throughput. Nginx's static file serving performance is largely attributable to sendfile. Python exposes this via os.sendfile(out_fd, in_fd, offset, count) on Linux and macOS.

Q5: When would you use O_DIRECT and what constraints does it impose?

O_DIRECT bypasses the kernel's page cache - I/O goes directly between the process's userspace buffer and the storage device. This is used by databases (PostgreSQL, MySQL, Oracle) that implement their own buffer cache and do not want the OS to double-buffer data. Without O_DIRECT, a database would maintain its own 2 GB buffer pool AND the OS would cache the same data in the page cache - wasting 2 GB of RAM and causing cache coherency complexity.

The constraints are strict: (1) All I/O must be aligned to the logical block size of the device (typically 512 bytes or 4096 bytes). The buffer address, file offset, and transfer size must all be multiples of this alignment. Unaligned I/O with O_DIRECT returns EINVAL. (2) Buffers must be aligned in memory - typically allocated with posix_memalign or mmap. In Python, use bytearray aligned via ctypes or a numpy array with np.empty(n, dtype=np.uint8) which is typically aligned. (3) Read/write sizes must be multiples of the block size. (4) O_DIRECT does not guarantee durability - writes are sent to the storage controller, not necessarily flushed to non-volatile storage. Combine with O_SYNC or fsync(2) for durability.

© 2026 EngineersOfAI. All rights reserved.